This is a slighlty adapted version of Neumann & Evert’s (2021)
reproduction script available from https://www.stephanie-evert.de/PUB/NeumannEvert2021/
(RMarkdown notebook analysis_proceedings.Rmd in ZIP archive
analysis_scripts.zip).
In particular, it uses the original support functions in
multivar_utils.R whereas our own replication study builds
on the new R package gmatools. We have made the following
changes to the reproduction script:
rgl is no
longer neededice_preprocessed.rda; in particular, our
ICE9 data set is reduced to the original three ICE3 componentsmvar_utils.R, which are
caught by more recent versions of R than the one used by Neumann &
Evert (2021)pdf_repro/ and can
easily be compared with our main reproduction/replication based on
gmatoolsLoad the preprocessed data set.
var.names <- load("ice_preprocessed.rda")
## Meta, rand.idx, Features, M, Z, ZL, types.variety, types.shortvar, types.mode, types.format, types.textcat32, types.short32, types.code32, types.textcat20, types.short20, types.code20, types.textcat12, types.short12, types.code12, rainbow.32, rainbow.20, rainbow.12, feature.names
All metadata variables are already coded as factors with a sensible
ordering of categories, so no further pre-processing is required here.
The data set also includes rainbow colours for text categories and
readable feature names. There are 7930 texts. See
prepare_data.Rmd for details about the distribution of
metadata categories and text lengths.
Since the original reproduction script expects a data set covering only the ICE3 components, we have to reduce data matrices, metadata, and the list of language variety types.
ice3.comp <- qw("icehk icejam icenz")
idx <- Meta$variety %in% types.variety[ice3.comp]
Meta <- droplevels(Meta[idx, ])
Features <- Features[idx, ]
M <- M[idx, ]
Z <- Z[idx, ]
ZL <- ZL[idx, ]
rand.idx <- rank(rand.idx[rand.idx %in% which(idx)])
types.variety <- types.variety[ice3.comp]
types.shortvar <- types.shortvar[ice3.comp]
A standard PCA based on z-scores reveals dimensions of register variation that correspond fairly well to the broad text categories in ICE.
PCA <- mvar.space(Z)
Z.pca <- mvar.projection(PCA, space="both")
mvar.pairs(Z.pca, 1:4, Meta=Meta, col=textcat20, pch=variety,
pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("pca4z_type.pdf")
The PCA based on log-transformed looks quite similar. Individual outliers are reduced and the main dimensions in the top left panel seem to show a little more structure. Therefore, we will exclusively use the log-transformed z-scores from now on.
PCA <- mvar.space(ZL)
ZL.pca <- mvar.projection(PCA, space="both")
mvar.pairs(ZL.pca, 1:4, Meta=Meta,
col=textcat20, col.vals=rainbow.20,
pch=variety, pch.vals=c(1, 3, 4),
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("pca4_type.pdf")
Do the main dimensions of variation also capture differences between the language varieties? A few regions occupied by a single variety likely correspond to discrepancies between text categories in the three corpora.
mvar.pairs(ZL.pca, 1:4, Meta=Meta,
col=variety, col.vals=bright.pal,
pch=variety, pch.vals=c(1, 3, 4),
cex=.6, legend.cex=.8, iso=TRUE, compact=TRUE)
save.pdf("pca4_var.pdf")
We now carry out an LDA by text category. Since there are 20 distinct categories, there will be a much larger number of discriminant dimensions.
lda.type <- mvar.discriminant(ZL, Meta$textcat20)
ByType <- mvar.space(ZL, lda.type, normalize=TRUE)
ByType.M <- mvar.projection(ByType, "both")
lda.type.P <- mvar.basis(ByType, "space")
mvar.pairs(ByType.M, 1:6, Meta=Meta,
col=textcat20, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.4, iso=TRUE, compact=TRUE)
save.pdf("lda6type_type.pdf", width=8, height=8)
There are 19 LDA dimensions, most of which capture interesting and substantial differences between text categories. An SVM classifier shows that together they separate the 20 text categories fairly well (with classification accuracy > 70%, though overtrained by the LDA).
res.type <- svm(ByType.M[, 1:19], Meta$textcat20, kernel="radial", cross=10)
svm.report <- function (res) {
acc <- mean(res$accuracies)
cat(sprintf("Mean accuracy: %.1f%%\n", acc))
cat("Cross-validation folds:\n")
print(round(res$accuracies, 1))
}
svm.report(res.type)
## Mean accuracy: 72.4%
## Cross-validation folds:
## [1] 67.7 76.7 72.1 72.4 70.0 71.6 75.6 71.0 71.0 75.6
However, this is not a useful perspective on the high-dimensional data set: it is impossible to grasp a 19-dimensional visualization intuitively and it would not provide a substantial reduction from the original feature space. The first 5 or 6 dimensions provide a discrimination accuracy well above 60%, and the first 4 are already close to 60%. We will therefore focus on LDA dims 1 to 4.
res.type <- svm(ByType.M[, 1:4], Meta$textcat20, kernel="radial", cross=10)
svm.report(res.type)
## Mean accuracy: 59.9%
## Cross-validation folds:
## [1] 59.2 57.6 56.9 59.4 58.3 64.9 59.4 64.0 58.7 61.1
In order to estimate how much information is lost by focusing on these dimensions, we compute pairwise discrimination quality between text categories. The only practicable way seems to be to find a single LDA discriminant dimension for each category pair and compute classification accuracy and Cohen \(d\):
discriminate.categories <- function (M, cats, digits=NULL) {
stopifnot(length(cats) == 2, all(cats %in% levels(Meta$textcat20)))
idx <- Meta$textcat20 %in% cats
y <- droplevels(Meta$textcat20[idx])
x <- M[idx, , drop=FALSE]
res <- predict(lda(x, y)) # LDA classification + dimension scores
acc <- 100 * sum(res$class == y) / length(y)
d <- abs(cohen.d(res$x[y == cats[1]], res$x[y == cats[2]]))
if (!is.null(digits)) {
acc <- round(acc, digits)
d <- round(d, digits)
}
data.frame(acc=acc, d=d, cat1=cats[1], cat2=cats[2], row.names=NULL,
stringsAsFactors=FALSE)
}
discriminate.pairwise <- function (M, cats, sort=FALSE, digits=NULL) {
cat.pairs <- combn(cats, 2, simplify=FALSE)
res <- lapply(cat.pairs, discriminate.categories, M=M, digits=digits)
res <- do.call(rbind, res)
if (sort) res <- res[order(res$d), ]
res
}
In the full LDA space, all pairs of text categories can bed discriminated fairly well, but with fewer dimensions some categories are collapsed. We obtain pairwise discrimination scores for different numbers of LDA dimensions and combine them into a single data frame. Interactively explore this table in order to find text categories that collapse due to our focus on 4 dimensions.
res <- discriminate.pairwise(ByType.M, types.textcat20, digits=2)
res.6 <- discriminate.pairwise(ByType.M[, 1:6], types.textcat20, digits=2)
res.4 <- discriminate.pairwise(ByType.M[, 1:4], types.textcat20, digits=2)
stopifnot(all.equal(res[, qw("cat1 cat2")], res.6[, qw("cat1 cat2")]),
all.equal(res[, qw("cat1 cat2")], res.4[, qw("cat1 cat2")]))
res$acc.6 <- res.6$acc; res$d.6 <- res.6$d
res$acc.4 <- res.4$acc; res$d.4 <- res.4$d
discrim.table <- res[order(res$d.4), ]
datatable(discrim.table, options=list(order=list(list(8, "asc"))),
caption = "Text category discrimination in full LDA space")
discrim.mat <- matrix(0, length(types.textcat20), ncol=length(types.textcat20),
dimnames=list(types.textcat20, types.textcat20))
with(discrim.table, {
discrim.mat[cbind(cat1, cat2)] <<- acc.4
discrim.mat[cbind(cat2, cat1)] <<- acc.4
})
discrim.pal <- heat_hcl(20, h=c(-20, 90), l=c(20,100))
heatmap.2(discrim.mat, zlim=c(0, 100), col=discrim.pal, margins=c(12, 12),
cexRow=1.2, cexCol=1.2, srtRow=30, srtCol=60, trace="none",
keysize=1, key.xlab="discrimination accuracy",
main="discrimination accuracy in 4 LDA dims")
save.pdf("lda4type_discrimination.pdf", width=8, height=8)
In an extension of our previous methodology, we now apply a rotation in the reduced target space so that interesting structure is better aligned with the subspace dimensions (instead of visually picking out an “axis” of interest). Since dims 1 and 2 are strongly correlated (Pearson \(r\) = 0.739496), a PCA-based rotation should align the diagonal axis with the first dimension. Note that a full PCA rotation in all four dimensions would lose too much of the discriminative structure brought out by the LDA (where the first dimension maximises the ratio of between-group and within-group variance). In an earlier version of the analysis we also flipped the two dimensions after rotation so that the largest variance is on the horizontal axis in the top-left panel of the scatterplot matrix and in 3D plots. However, it is much clearer to have the main dimension of variation as dimension 1.
ByType4 <- ByType %>% mvar.basis("space") %>% extract(, 1:4) %>% mvar.space(ZL, .)
ByType4 %<>% mvar.rotation("pca", dims=1:2) # %>% mvar.rotation("swap", dims=1:2)
ByType4.M <- mvar.projection(ByType4, "both")
ByType4.P <- mvar.basis(ByType4, "space")
This four-dimensional latent space is the basis for all further analysis and interpretation.
mvar.pairs(ByType4.M, 1:4, Meta=Meta,
col=textcat20, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.20,
cex=.6, legend.cex=.7, iso=TRUE, compact=TRUE)
save.pdf("lda4type_type.pdf", width=8, height=8)
There appear to be two overlapping “cigars” formed by the written and spoken texts. A scatterplot matrix colour-coded for mode shows this clearly.
mvar.pairs(ByType4.M, 1:4, Meta=Meta,
col=mode, pch=variety, pch.vals=c(1, 3, 4), col.vals=bright.pal,
cex=.6, legend.cex=.8, iso=TRUE, compact=TRUE)
save.pdf("lda4type_mode.pdf", width=8, height=8)
We create a custom version of the scatterplot matrix for inclusion in the paper, showing only the top row of the scatterplot matrix separately for spoken and written texts. It is written directly to a PDF file to ensure proper font sizes and layout.
# cairo_pdf(file="pdf_repro/lda4type_for_paper.pdf", width=12, height=8)
pch.vec <- c(1, 3, 4)[Meta$variety]
col.vec <- rainbow.20[Meta$textcat20]
plot.panel <- function (d, idx, cex=1, # -> 3:4 aspect ratio
xlim=c(-2.05, 2.0), ylim=c(-3.1, 2.3)) {
plot(ByType4.M[idx, d], ByType4.M[idx, 1],
pch=pch.vec[idx], col=col.vec[idx],
xlim=xlim, ylim=ylim, cex=cex,
xlab="", ylab="", main="", xaxt="n", yaxt="n")
}
textcat.W <- unique(Meta$textcat20[Meta$mode == "written"])
idx.lW <- types.textcat20 %in% textcat.W
par(mfrow=c(2, 4), mar=c(0, 0, 0, 0)+.2)
idx.W <- Meta$mode == "written"
plot.panel(2, idx.W)
text(-2, -0.4, "Dim 1", cex=1.2, srt=90, font=2)
plot.panel(3, idx.W)
plot.panel(4, idx.W)
plot(0, 0, type="n", ann=FALSE, bty="n", xaxt="n", yaxt="n")
legend(0, 0, xjust=0.5, yjust=0.5, cex=1.3,
title="Written Texts", bty="n",
legend=types.textcat20[idx.lW],
fill=rainbow.20[idx.lW], border=rainbow.20[idx.lW])
idx.S <- Meta$mode == "spoken"
plot.panel(2, idx.S)
text(-2, -0.4, "Dim 1", cex=1.4, srt=90, font=2)
text(0, 2.3, "Dim 2", cex=1.4, font=2)
plot.panel(3, idx.S)
text(0, 2.3, "Dim 3", cex=1.4, font=2)
plot.panel(4, idx.S)
text(0, 2.3, "Dim 4", cex=1.4, font=2)
plot(0, 0, type="n", ann=FALSE, bty="n", xaxt="n", yaxt="n")
legend(0, 0, xjust=0.5, yjust=0.5, cex=1.3,
title="Spoken Texts", bty="n",
legend=types.textcat20[!idx.lW],
fill=rainbow.20[!idx.lW], border=rainbow.20[!idx.lW])
save.pdf("lda4type_for_paper.pdf", width=12, height=8)
# invisible(dev.off())
The original 32 text categories in the fine-grained ICE design schema were aggregated into 20 broader categories, which are more manageable in our visualisation-based approach. This seems to be corroborated by the fact that no differences between the finer sub-categories are visible in our four LDA dimensions: they are spread out and intermingled evenly across the broader category. However, another explanation is that our LDA – based on the 20-category scheme – has ignored differences between the subcategories, aiming to reduce variability within the broader category.
Here, we carry out an alternative LDA using the 32-category scheme as a “gold standard” and compare the first four latent dimensions. The dimensions are automatically reordered and flipped to best match those of the original LDA.
lda.type32 <- mvar.discriminant(ZL, Meta$textcat32)
ByType32 <- mvar.space(ZL, lda.type32[, 1:4], normalize=TRUE)
lda.type32.P <- mvar.basis(ByType32, "space") # original orthogonalised dims
ByType32 %<>% mvar.rotation("pca", dims=1:2) %>% mvar.rotation("match", basis=ByType4.P)
ByType32.M <- mvar.projection(ByType32, "both")
ByType32.P <- mvar.basis(ByType32, "space")
mvar.pairs(ByType32.M, 1:4, Meta=Meta,
col=textcat32, pch=variety, pch.vals=c(1, 3, 4), col.vals=rainbow.32,
cex=.6, legend.cex=.55, iso=TRUE, compact=TRUE)
save.pdf("lda4type32_type.pdf", width=8, height=8)
Overall the visualisation looks reassuringly similar. We do get better separation between the more fine-grained categories, but the overall shape remains the same. It seems safe thus to report our findings based on the 20-category scheme.
We compute similarity of the two LDA analyses as the (fractional) number of common dimensions (see SIGIL Unit #7 for details), which indicates a very good match between the two spaces.
mvar.similarity(ByType4.P, ByType32.P)
## [1] 3.873358
The expected \(R^2\) for projection between the subspaces is 94.0%. The vector of singular values shows that the two subspaces (almost) share three dimensions, with relatively high cosine similarity in the fourth dimension.
mvar.similarity(ByType4.P, ByType32.P, method="sigma")
## [1] 0.9994853 0.9968407 0.9929528 0.8840793
Verify that we indeed get the same result without matching the basis vectors beforehand:
mvar.similarity(lda.type.P[, 1:4], lda.type32.P, method="sigma")
## [1] 0.9994853 0.9968407 0.9929528 0.8840793
The standard approach in multidimensional analysis is to interpret the feature weights (or “factor loadings”) of each latent dimension directly. In our case, these weights are the coordinates of the orthogonal basis vectors of the LDA space. The barplot below visualizes only features \(i\) that have a substantial weight \(|p_{ij}| \geq .1\) in at least one dimension \(j\). Keep in mind that feature weights are relative within each basis vector (because \(\|\mathbf{p}_{\bullet j}\|_2 = 1\)); a discriminant characterised by consistently large values of many different features would assign relatively low weights to all of them.
ByType4.P <- mvar.basis(ByType4, "space")
idx.weights <- apply(abs(ByType4.P), 1, max) >= .1 # only show features with substantial weight
ggbar.weights(ByType4.P, feature.names=feature.names, names=paste("Dim", 1:4),
idx=idx.weights, ylim=c(-.75, .4))
save.pdf("lda4type_weights.pdf", width=12, height=8)
For comparison, we show the first two original LDA dimensions before PCA rotation (but only for the features selected above).
ggbar.weights(lda.type.P[, 1:2], feature.names=feature.names,
idx=idx.weights, ylim=c(-.75, .4),
names=c("Original LDA dim 1", "Original LDA dim 2"))
save.pdf("lda6type_weights.pdf", width=12, height=4.5)
For the paper, we create individual plots for each dimension with suitable margins and labelling. They are directly written to PDF files for optimal formatting.
dim2label <- c("LD1" = "Dim 1: conceptual speaking / conceptual writing",
"LD2" = "Dim 2: dialogic written / neutral",
"LD3" = "Dim 3: descriptive-narrative / instructive-regulative",
"LD4" = "Dim 4: neutral / online production")
for (d in seq_len(ncol(ByType4.P))) {
cairo_pdf(file=sprintf("pdf_repro/lda4type_weights_dim%d.pdf", d),
width=12, height=4)
p <- ByType4.P[, d, drop=FALSE] %>%
ggbar.weights(feature.names=feature.names, names=paste("Dim", d),# names=dim2label[d],
idx=idx.weights, ylim=c(-.75, .4), main=dim2label[d])
print(p + theme(axis.text.x=element_text(size=14)))
dev.off()
}
As we’ve argued before, especially for LDA-based dimensions, it is
more informative to visualise how each feature pushes the texts from
some category or group towards the positive or negative end of a
dimension. We use a wrapper around ggbox.features() to pick
out categories and plot accordingly.
ggbox.selected <- function (M, Meta, weights, cats,
variable="short20", colours=rainbow.20,
what="contribution", main="",
group.labels=FALSE, ...) {
stopifnot(all(cats %in% names(colours)),
all(cats %in% levels(Meta[[variable]])))
group.vec <- as.character(Meta[[variable]])
Meta$grouping <- factor(ifelse(group.vec %in% cats, group.vec, "other"),
levels=c("other", cats))
col.values <- c("#666666", colours[cats])
ggbox.features(M, Meta, what=what,
weights=weights, id.var="id",
group=grouping, group.palette=col.values,
feature.names=feature.names,
main=main, group.labels=group.labels, ...) +
theme(strip.text.x=element_text(angle=70, hjust=0.2, vjust=0.2))
}
ggbox.selected(ZL, Meta, ByType4.P[, 1], select=idx.weights,
cats=qw("conv,acad", sep=",\\s*"),
main=dim2label["LD1"])
save.pdf("lda4type_box_example1.pdf", width=12, height=8)
We can also plot other dimensions, or select subsets of texts (e.g. a particular variety). The code chunk below illustrates relevant parameters, focusing on the third LDA dimension.
ggbox.selected(ZL, Meta, ByType4.P[, 3],
cats=qw("conv,acad", sep=",\\s*"),
select=idx.weights, subset=(shortvar == "NZ"),
main=sprintf("%s - New Zealand", dim2label["LD3"]))
save.pdf(file="lda4type_box_example2.pdf", width=12, height=8)
ggbox.selected(ZL, Meta, ByType4.P[, 3],
cats=qw("acad,popSci,creat", sep=",\\s*"),
variable="short12", colours=rainbow.12,
select=idx.weights,
main=sprintf("%s", dim2label["LD3"]))
save.pdf(file="lda4type_box_example3.pdf", width=12, height=8)
Boxes can be formed based on arbitrary metadata variables, provided that we have a suitably labelled vector of colour codes. Here we create one bespoke box with modified formatting for inclusion in the paper:
ggbox.selected(ZL, Meta, ByType4.P[, 1], select=idx.weights,
cats=qw("conv,news", sep=",\\s*"),
main=dim2label["LD1"], ylim=c(-1.1, 1.0), group.labels=TRUE) +
theme(axis.text.x=element_text(angle=52, hjust=1), legend.position="none")
save.pdf("lda4type_box_figure4.pdf", width=12, height=9)